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The question of how genetic variation in a population influences phenotypic variation and evolution is of major impor* 
tance in modern biology. Yet much is still unknown about the relative functional importance of different forms of 
genome variation and how they are shaped by evolutionary processes. Here we address these questions by population 
level sequencing of 42 strains from the budding yeast Saccharomyces cerevisiae and its closest relative S. paradoxus. We 
And that genome content variation, in the form of presence or absence as well as copy number of genetic material, is 
higher within S. cerevisiae than within S. paradoxus, despite genetic distances as measured in single-nucleotide polymor- 
phisms being vastly smaller within the former species. This genome content variation, as well as loss-of'function variation 
in the form of premature stop codons and frameshifting indels, is heavily enriched in the subtelomeres, strongly 
reinforcing the relevance of these regions to functional evolution. Genes affected by these likely functional forms of 
variation are enriched for functions mediating interaction with the external environment (sugar transport and metab* 
olism, flocculation, metal transport, and metabolism). Our results and analyses provide a comprehensive view of genomic 
diversity in budding yeast and expose surprising and pronounced differences between the variation within S. cerevisiae 
and that within S. paradoxus. We also believe that the sequence data and de novo assemblies will constitute a useful 
resource for further evolutionary and population genomics studies. 

Key words: population genomics, functional variation, genome evolution, yeast, subtelomeres, loss'of-function variants. 



Abstract 



Introduction 



attractive system in which to address these questions, 
having long served as an important model organism for mole- 
cular biology and genetics but also for comparative genomics 
and the study of genome evolution (Cliften et al. 2003; Kellis 
et al. 2003; Dujon et al. 2004; Dujon 2010; Hittinger 2013). 
Recent years have seen a growing interest in using S. cerevisiae 
and its closest relatives to study natural variation, ecology, and 
population level genetics and genomics (Replansky et al. 2008; 
Liti and Schacherer 2011; Hittinger 2013) as well to exploit 
natural variation in the study of the genetic architecture of 
traits (Liti and Louis 2012). In one of the first population 
genomics studies, we revealed the strong population struc- 
tures of S. cerevisiae and its closest relative S. paradoxus, with 
most variants being private to specific phylogenetic lineages, 
and demonstrated the influence of human activity and 



A central issue in biology is how genetic variation influences 
variation in organismal phenotypes, as well as how this 
variation is shaped by evolutionary processes. So far, the 
emphasis in population genomics studies has been on 
single-nucleotide polymorphisms (SNPs), which are the 
most abundant form of sequence variation and therefore 
the most informative about population history. However, in- 
creasing attention is being devoted to other forms of variation 
including structural, genome content, and copy number var- 
iation (CNV), which might be more likely to have large phe- 
notypic effects (Conrad et al. 2010; Sudmant et al. 2010). Little 
is known, however, about the relative biological importance 
of these various forms of variation in different species. The 
Baker's yeast Saccharomyces cerevisiae has emerged as an 
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domestication on the evolution of the former but not the 
latter species (Liti, Carter, et al. 2009). A number of other 
studies have corroborated and extended these findings in 
various directions (Doniger et al. 2008; Schacherer et al. 
2009; Zhang et al. 2010; Hyma et al. 2011; Warringer et al. 
2011; Dunn et al. 2012; Hyma and Fay 2013), such that a 
coherent framework for understanding the population struc- 
ture and evolutionary history of Baker's yeast is now 
emerging. 

A question of primary interest in yeast population geno- 
mics has been the extent to which the highly stratified genetic 
structure is driven by on the one hand geography and on the 
other hand by ecology and adaptation to different lifestyles 
and environmental niches. So far, the evidence points in favor 
of geography being the most important factor (Liti, Carter, 
et al. 2009; Cromie et al. 2013; Hittinger 2013). Saccharomyces 
paradoxus populations found in different parts of the world 
are highly diverged at the sequence level, do not seem to have 
recently exchanged genetic material, and in some cases dis- 
play partial reproductive isolation (Sniegowski et al. 2002; 
Koufopanou et al. 2006; Liti, Carter, et al. 2009). The genetic 
structure of S. cerevisiae is similarly characterized by strong 
geographical stratification but is nuanced by a larger degree of 
admixture between populations and the presence of strains 
with phylogenetically mosaic genomes (Liti, Carter, et al. 2009; 
Cromie et al. 2013). This admixture has likely been facilitated 
by the association of S. cerevisiae with human fermentation 
practices and deliberate or accidental dispersal of domesti- 
cated strains to different parts of the world (Hyma and Fay 
2013). Overall, genetic divergence is much higher in S. para- 
doxus than in S. cerevisiae, the most highly diverged strains in 
the former species being separated by ~3.5% at the sequence 
level as compared to 0.5-0.8% in the latter (Liti, Carter, et al. 
2009). Surprisingly, however, phenotypic diversity is substan- 
tially higher in S. cerevisiae than in S. paradoxus (Liti, Carter, 
et al. 2009; Warringer et al. 2011). This observation is highly 
unexpected under a model where phenotypic variation is 
primarily the product of gradual accumulation of SNPs 
throughout the genome. Therefore, it raises the hypothesis 
that other evolutionary processes, potentially involving other 
forms of genetic variation, are responsible. 

Several advances have been made in the search for the 
genetic variants that underlie phenotypic variation in yeast, 
primarily in S. cerevisiae. Due to the high degree of population 
stratification, genome-wide association studies are problem- 
atic to carry out in natural yeast populations (Connelly and 
Akey 2012), but studies utilizing recombinant populations 
obtained by crossing diverged parental strains have proven 
fruitful (Liti and Louis 2012). Quantitative trait loci (QTLs) 
have been identified for a range of traits including the ability 
to grow at high temperature (Steinmetz et al. 2002; Sinha et al. 
2008; Parts et al. 2011), resistance to numerous chemical 
compounds (Ehrenreich et al. 2010, 2012), and enological 
traits (Marullo et al. 2007; Ambroset et al. 2011; Salinas 
et al. 2012). Although most genotype-phenotype mapping 
has been performed using the same small set of parent strains 
with a focus on laboratory strains, an increasing number of 
studies are utilizing additional strains that cover more of the 



genetic diversity within the species (Wenger et al. 2010; 
Cubillos et al. 2011; Parts et al. 2011; Ehrenreich et aL 2012). 
Less attention has been devoted to the genetic architecture of 
traits in S. paradoxus, though QTLs for telomere length have 
been identified (Liti, Haricharan, et al. 2009). Litde is known 
about the evolutionary forces that shape the relationship 
between genotype and phenotype in natural yeast popula- 
tions. It has been proposed that the predominance of asexual 
growth and self-fertilization in the natural life history of yeast 
(Ruderfer et al. 2006; Tsai et al. 2008) makes genetic drift the 
dominant force driving phenotypic evolution, such that del- 
eterious variants become fixed in specific lineages following 
repeated population botdenecks (Warringer et al. 201 1; Zorgo 
et al. 2012). However, there are also studies reporting some 
evidence for a role of positive selection (Fraser et al. 2010). 

So far, population genomics analyses in yeast have mostly 
relied on microarray technology (Schacherer et al. 2007, 2009) 
and low-coverage capillary sequencing (Liti, Carter, et al. 
2009). A number of individual strain genomes have also 
been sequenced for various purposes, mostly using capillary 
and pyrosequencing (Wei et al. 2007; Borneman et al. 2008; 
Doniger et al. 2008; Argueso et al. 2009; Novo et al. 2009; 
Dowell et al. 2010; Akao et al. 2011; Borneman et al. 2011; 
Nijkamp et al. 2012; Raiser et al. 2012; Zheng et al. 2012). Due 
to rapid technological advances, next-generation sequencing 
technologies generating short reads have become the most 
cost-effective choice for population-level whole-genome se- 
quencing. Here we apply high-coverage lllumina sequencing 
to 42 natural strains from S. cerevisiae and S. paradoxus. In 
both species, strains were selected from the set of strains 
previously surveyed using low-coverage capillary sequencing 
(Liti, Carter, et al. 2009) and comprehensive phenotypic anal- 
yses (Warringer et al. 2011) to be as informative as possible 
about overall genetic diversity. We construct de novo assem- 
blies and perform analyses to advance our understanding of 
genomic diversity and evolution, funcdonal variation, and the 
genotype-phenotype relationship. In addition to SNPs, this 
data enable analyses of other kinds of variants, such that we 
can begin to address directly the aforementioned questions 
about the nature and evolution of genomic variation and its 
potential phenotypic relevance. We furthermore hope that 
the sequence data and assemblies will constitute a useful 
addition to the growing set of resources available for popu- 
ladon genomics and genotype-phenotype studies in yeast. 
Data and results from the project are accessible at http:// 
www.moseslab.csb.utoronto.ca/sgrp/ (last accessed January 
14, 2014). 

Results and Discussion 

De Novo Yeast Genome Assemblies from Short-Read 
Sequencing Data 

We selected 42 haploid (or homozygous diploid) yeast strains 
and sequenced their genomes to intermediate-to-high 
(10-60x) coverage using paired-end lllumina technology. 
A subset of six strains were sequenced to very high (400-80 
Ox) coverage (table 1). Genomes of 14 S. cerevisiae and 13 S. 
paradoxus strains for which the coverage was higher than 



873 



Bergstrom et al. ■ doi;10.1093/molbev/msu037 



MBE 



s 



o 



E 

3 

z 



O 



3 

a. 

o 

Q. 

3 



m -~ 
M 00 v/^ -JT 
c^ 00 m ro 



ooor^i/^rNi/ivova 



ro »- 00 



I I I I I 



00 I- 1^ 



1- 



i/i rj 1- 



\0 (N c^ 



G^ \0 



(N (N t- 



<N N fO 



I I I I I 



»- 

m in (N 



vn 00 



ro »- 1- 



I- 1- 



I I I I I 



1- fo ro 



1- 



ITt ^ ^ 



00 C^ (N 

t£ t£ 



I- 1^ 
ro -a- 



i/) 1^ \o in 



I I I I I 



\0 Lfl 1— 

ro G^ C\ 



in 1- m 



\a in 



I I 



\o m va 



< < = 

1/1 i/i ^ 
D 3 U 



I 2 S 



< 

3 



o o a .>i 



5 :3 



1 1 

^ = M 
iS 5 u 



5 5 b 



I 



I I 



.1 i 
5 s 



I I 



5 g 



S. g. g. £. i S. 



g 5 5 g 5 5 



in 1- (N «- 



vn \B ro 1^ 
^ rn po 
in m \B 



1- t- in in 



00 1- 



in t- 1— 



I I I I I I I 



cs c\ 



inuirMmmvovDi- 



I I I I I I I 



00 00 1- \o 



00 m o 1- 



I I I I I I I 



V£ \o lO 



>j) I- m o cv (N 

(N ^ C\ m 00 

o in va- ^ 

r*? vjj I/? vj? 1^ 

(N »- 1- 00 

^ VC IS V£ 



mm 
^ in in in 

^ ^ \D ^ 



I I 



I I I I I I I 



c^ \o m in 



00 


1529 






00 




00 
G^ 


2,058/ 


2,047( 


1,153/ 


1,134/ 



I I 



I I I 



)^ ^ ^ ^ ^ 



Jn ^ iii ^ 



^ ^ :^ ^ ^ :^ 



u u u u 



o'o'aaao'o'a 



a a a a a c 



(0 « ro « m *i 

£. S. ^ X 

o o o o o ^ a 

1- i_ i_ ^ C 

3 3 3 3 3 rt 5 

LU LU LU LU LU U. < 



^ m ^ gj U 

o. Q. a. a. 

o o o o u 

i_ l_ k. 1- c 

3 3 3 3 5 

LU LU LU LU < 



^ g. g. 



^ r: ^ ^ 



c c 

3; « S 

CO cfi 

Q >- Q 



S O 

i, 5 



E § R 2 S 

g S 5 G i. 



ui. i C_ lis Vfl wi. fvi w ^ 1^ OO Vfl. ^ OO t f. 



874 



A High-Definition View of Yeast Genome Variation ■ doi;10.1093/molbev/msu037 



MBE 



20x after quality filtering were de novo assembled. The six 
strains with very high coverage were used to explore the effect 
of sequencing coverage on de novo assembly quality by as- 
sembling data subsets systematically downsampled to differ- 
ent coverage levels. We find that assembly quality increases 
with increasing coverage up to but not above a level of about 
120x (supplementary fig. SI, Supplementary Material 
online). As almost all strains were previously sequenced to 
low coverage (1-4x) using Sanger technology (Liti, Carter, 
et al. 2009) and paired-end libraries with long inserts (mean 
insert size 4,480 bp), we could improve our assemblies by 
including this data. Mapping Sanger reads onto the lllumina 
assemblies and performing scaffolding based on paired-end 
information improved assembly connectivity substantially 
with an increase in average scaffold N50 from 64.8 to 
118.1 kb (table 1). 

The 27 de novo assemblies have total sizes between 11.58 
and 1 1.77 Mb, suggesting little variability in yeast genome size. 
This is 3.2-4.8% smaller than the completed S. cerevisiae ref- 
erence genome of 12.16 Mb, indicating a slight underestima- 
tion of true genome sizes, presumably due to collapse in the 
assemblies of transposable elements and perhaps a few other 
repeat regions. Underestimates are compatible with a re- 
ported Ty transposon content of 1-3% in these strains and 
a higher transposon content of 3.5% in the reference strain 
S288c (Liti, Carter, et al. 2009). For the majority of strains, the 
largest scaffolds are longer than the smallest chromosomes of 
the reference genome (table 1). Some scaffolds in the higher 
coverage assemblies correspond to near full-length chromo- 
somes. This illustrates that high-quality assembly of yeast ge- 
nomes is achievable using short-read sequencing data. The 
structurally complex subtelomeric regions did not generally 
assemble well however (with some exceptions; see below), a 
known problem in genome assembly. 

Assennbly Scaffolding by Genetic Linkage Reveals 
Extensive Structural Conservation except 
in Subtelomeres 

Four of the S. cerevisiae strains sequenced, the North Amer- 
ican YPS128, the West African DBVPG6044, the Sake/Japa- 
nese Y12 and the Wine/European DBVPC6765, have been 
used as parents for advanced intercross lines from which 
large numbers of highly recombined offspring genomes 
have been sequenced (Cubillos et al. 2013; lllingworth et al. 
2013). The genetic linkage that manifests between nearby loci 
in these artificial populations was here put to use to further 
scaffold the de novo assemblies of these four strains. This 
resulted in chromosome sized scaffolds for all 16 nuclear 
chromosomes in these strains, containing 94.6-96.1% of 
the total assembly sequence, and scaffold N50 values of 
852-890 kb. 

The linkage-assisted assemblies exposed the near-complete 
colinearity and structural conservation of the genomes of 
four of the major phylogenetic lineages in S. cerevisiae 
(fig. 1A). This is consistent with high rates of meiotic spore 
viability observed in crosses between these strains (Cubillos 
et al. 2011), as large-scale structural variation would impair 



proper segregation of chromosomes, as well as the high 
degree of karyotype conservation within the S. sensu stricto 
species clade (Fischer et al. 2000; Liti et al. 2013). Interestingly, 
the high degree of structural conservation does not extend 
into subtelomeric regions, consistent with their rapid evolu- 
tion and high variability (Liti et al. 2005; Brown et al. 2010). 
Lower assembly quality in these regions makes analysis diffi- 
cult; however, the genetic linkage data allowed the identifica- 
tion of some cases of structural variation in the subtelomeres 
relative to the S. cerevisiae reference genome (fig. IB). We also 
find subtelomeric material that is present in some strains and 
absent in others, for instance, a segment of approximately 
18 kb that localizes to the right subtelomere of chromosome 
XIII and assembled well in the North American and West 
African S. cerevisiae strains (fig. 1C). This genomic region is 
absent from the Wine/European, Sake/Japanese as well as the 
S. cerevisiae reference strain while being present in all S. para- 
doxus strains, constituting an example of a subtelomeric 
region that has recently been lost in certain S. cerevisiae 
strains. These kind of structural differences have implications 
for QTL mapping studies that rely on a reference genome, 
because the causative sequence responsible for a phenotypic 
association might in fact be located in a different subtelomere 
in the mapping strains or simply be absent from the reference 
genome, thus, risking that the search for candidates is di- 
rected to the incorrect genomic region (Cubillos et al. 2011). 

Genome and Gene Content Variation in S. cerevisiae 
Exceeds That in S. paradoxus 

High quality de novo genome assemblies enable the system- 
atic identification of long sequence segments that are present 
in only a subset of yeast strains, such as the region exemplified 
in figure 1C. We made pairwise comparisons between strain 
genomes and summed the total length of sequence regions 
larger than 1 kb that are present in one strain but not the 
other and refer to this as genome content variation. In 
S. cerevisiae this sum is always larger than the number of 
SNPs between a pair of strains for all possible strain compar- 
isons. This is not the case for S. paradoxus, surprisingly, as the 
amount of genome content variation within this species is 
lower than within S. cerevisiae, despite genetic variation in the 
form of SNPs being almost an order of magnitude larger 
(fig. 2A). In both species there is a positive correlation be- 
tween the SNP distance between strains and the amount of 
genome content difference, but the correlation is much 
weaker in S. cerevisiae than in S. paradoxus (r = 0.51 and 
r = 0.97, respectively). The highly variable relationship in 
S. cerevisiae suggests that this kind of variation is the product 
of a different mode of evolution than the clocklike nature of 
SNP accumulation. Nonetheless, clustering strains based on 
genome content differences recapitulates the known popu- 
lation structures of both species (supplementary fig. S2, 
Supplementary Material online). To ensure that the detected 
differences reflects true underlying genome variation, we val- 
idated the presence and absence of 14 variable regions across 
strains by polymerase chain reaction (PCR), confirming the de 
novo assembly predictions in 326/327 cases, as well as 
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Fig. 1. Yeast genome structures revealed by de novo assemblies augmented by genetic linkage data. (A) Scaffolding de novo assemblies using genetic 
linkage information from advanced intercross lines dramatically improves assembly connectivity and reveals extensive structural conservation of the 
core chromosomes in four of the major S. cerevisiae lineages. Displayed is a dot plot of sequence similarity between the assembly scaffolds of the strain 
YPS128 from the North American phylogenetic lineage and the 16 nuclear chromosomes of the S. cerevisiae reference genome (strain S288c), before and 
after the incorporation of the genetic linkage data into the scaffolding process. After scaffolding by genetic linkage, the majority of the assembly 
sequence is contained in 16 large scaffolds that are collinear with the chromosomes of the reference genome. Results are highly similar for the other 
three strains for which genetic linkage data is available; the West African strain DBVPG6044, the Wine/European strain DBVPG6765 and the sake/ 
Japanese strain Y1 2 (the recent sequencing of the sake strain Kyokai no. 7 (Akao et al. 201 1 ) revealed two intrachromosomal inversions in chromosomes 
V and XIV in relation to the reference strain S288c, however these are not shared by the sake strain Y12 sequenced here). Only scaffolds bigger than 
50 kb are displayed. (6) Structural rearrangements relative to the chromosome organization of the S. cerevisiae reference genome, all localized to the 
subtelomeric regions. A directed arrow indicates that a sequence region is aligning to the part of the reference genome where the arrow starts but in the 
de novo assembly is located in the part of the genome corresponding to where the arrow ends. (C) A subtelomeric 18-kb region that assembled well in 
several strains and could be localized by genetic linkage is displayed with coordinates corresponding to the YPS128 chromosome XIII scaffold. Six genes 
were found in this region by ab initio gene prediction (arrows indicate coding direction). 



compared one of our assemblies to an alternative assembly 
for the same strain and found no differences (see supplemen- 
tary fig. S3, Supplementary Material online and Materials and 
Methods). The observation that genome content variation is 
relatively larger within S. cerevisiae than within S. paradoxus is 
highly unexpected under a neutral model of genome 



evolution and implies pronounced differences in the evolu- 
tionary histories of these two species. Although challenging to 
establish at present, we suggest that this unexpected excess of 
genome content variation in S. cerevisiae is likely to be a major 
contributor to the equally surprising excess of phenotypic 
variability within this species. We furthermore note that 



876 



A High-Definition View of Yeast Genome Variation ■ doi;10.1093/molbev/msu037 



MBE 




200 250 300 

Number of SNPs (in thousands) 



B 



S. cerevisiae 



Number of non-reference genes 





DBVPG6765 — 
L1528 
YJ M975 
-DBVPG1373 
-DBVPG1106 

REF I 

W303 I 

Y55 I 

DBVPG6044 ---I 

SKI I 

UWOPS87-2421-I 

UWOPS83-787.3I 



45 



26 



I 35 

34 



Y12 



S. paradoxus 



25 



I 40 
I 40 



I 45 



61 



— 0.001 



•-YPS138 I 
-.N-44 --I 
r S36.7 - I 
-Q59.1 -I 
-Q95.3 - I 
-REF -- I 
-Y7 --- I 
-Z1 -- 
-Y8.5 --I 
-Z1.1 --I 
-T21.4 - I 
rY6.5 —I 

1,W7 I 

"Y9.6 --I 



142 



162 



Fig. 2. Genome content variation within natural yeast populations. (A) The relationship between genetic distance between strains as measured in SNPs 
and the amount of genomic material being present/absent between strains. All pairwise strain comparisons within each of the two species are included. 
(6) The number of nonreference genes found in each strain genome. Strain colors denote subpopulation origin (for S. cerevisiae: green = Wine/European, 
red = West African, cyan = Malaysian, yellow = North American, dark blue = Sake/Japanese, black = mosaic genome; for S. paradoxus: orange = 
American, brown = Far Eastern, magenta = European). The strain trees are neighbor-joining trees based on genome-wide SNP distances and the 
scale bars indicate sequence distance in units of SNPs per basepair (distance scales differ between the species). 



there are considerable amounts of genomic sequence that is 
present in one or more natural strains but absent in the S. 
cerevisiae reference strain S288c (supplementary fig. S4, 
Supplementary Material online). 

We annotated the de novo assemblies by homology and 
synteny comparisons to the reference genome. Out of 5,774 
nondubious open reading frames (ORFs) in the S288c 
S. cerevisiae reference genome, we recovered a median of 
5,417 ORFs (94%) per strain, with most failures to recover a 
gene appearing to be caused by collapse in the assemblies of 
very close paralogs. We also identified genes that are not 
present in the reference genome. The number of nonrefer- 
ence genes present in a S. cerevisiae strain genome ranges 
from four in the lab strain W303, which is very closely related 
to the reference strain S288c, to 61 in the mosaic strain 
UWOPS87-2421, with a median of 31 genes per strain 
(fig. 2B). Using the four S. cerevisiae strains for which genetic 



linkage data allowed assembly of chromosome-sized scaffolds, 
we estimate that at least 75% of these nonreference genes are 
located in the subtelomeric parts of the chromosomes. By 
exploiting distant sequence similarity to functionally anno- 
tated reference genome proteins, we find that the set of 
nonreference genes is enriched for gene ontology terms 
related to flocculation and sugar — particularly maltose — 
transport and metabolism. These results are consistent with 
findings on the evolutionary properties of subtelomeric genes 
across larger evolutionary distances (Brown et al. 2010). 

CNV Is Extensive in Subtelomeres and Greater in 
S. cerevisiae than in S. paradoxus 
Whereas the analysis described in the previous section is only 
powered to detect binary presence and absence due to the 
tendency of similar copies to collapse during de novo 
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assembly, by mapping reads to the reference genome we can 
also assay differences between higher copy numbers. We 
identified CNV across all strains with a mapped coverage of 
8x or higher (18 S. cerevisiae strains and 19 S. paradoxus 
strains). The total size of genomic regions exhibiting CNV 
between strains is three times larger in S. cerevisiae than in 
S. paradoxus (423 kb vs. 142 kb, corresponding to 3.5% and 
1.2% of the total genome size, respectively), despite lower 
levels of overall genetic divergence in the former species. 
Although the true amount of CNV in S. paradoxus could 
be slightly underestimated due to the quality of the reference 
genome being lower than that of S. cerevisiae (containing 
multiple gaps where CNV detection is not possible), this is 
unlikely to explain the large difference observed (see Materials 
and Methods). These results mirror the recent finding of dif- 
ferences in CNV rates between the great ape lineages 
(Sudmant et al. 2013). CNV similarity between strains strongly 
correlates with SNP similarity within both S. cerevisiae and S. 
paradoxus (r = 0.843 and r - 0.885, respectively), and cluster- 
ing strains based on CNV profiles recapitulates the broad 
phylogenetic structures of the two species (supplementary 
fig. S5, Supplementary Material online). In both S. cerevisiae 
and S. paradoxus we find very limited CNV in nonsubtelo- 
meric regions and extensive variation in the subtelomeric 
regions. In S. cerevisiae, 32.0% of subtelomeric nucleotide po- 
sitions are affected by CNV compared with 0.7% in nonsub- 
telomeric regions (42-fold enrichment), and in S. paradoxus 
the corresponding numbers are 9.3% and 0.04%, respectively 
(23-fold enrichment). In S. cerevisiae, genes contained in re- 
gions displaying CNV are enriched for gene ontology terms 
related to sugar transport and metabolism, flocculation and 
ion and metal transport and metabolism. The same trends of 
enrichment are observed in S. paradoxus (although with 
fewer categories reaching statistical significance as a conse- 
quence of the overall lower number of genes affected). Other 
genes with variable copy number between strains in both 
species include the high copy number subtelomeric YRF1 
helicase, PAU seripauperin, and DUP gene families. These re- 
sults imply that largely similar evolutionary forces are shaping 
the landscapes of CNV in these two species. By considering 
aggregate sequencing depth across close paralogs in the ref- 
erence genome, we could unveil further trends not necessarily 
apparent when considering genes one by one. This gene 
family based analysis showed that one clinically isolated 
strain from the Wine/European phylogenetic cluster, 
YJM981, harbors an extremely large number of YRFI gene 
copies; on the order of 1,000 copies, compared with esdmates 
of 10-40 copies in other strains including the reference 
genome. This radical copy number increase is consistent 
with an experimentally demonstrated phenomenon where 
the subtelomeric Y'-element, which contains the YRFl 
genes, is amplified to serve as an alternative mechanism for 
telomere maintenance following failure of the conventional 
telomerase system (Lundblad and Blackburn 1993). This has 
dramadc consequences on subtelomeric structure and overall 
genome size, and we confirmed by pulsed field gel electro- 
phoresis that the chromosomes of YJM981 are much larger 
than normal (supplementary fig. S6, Supplementary Material 



online). Another closely related clinical isolate, YJM978, also 
harbors an usually large number of YRFl copies, though much 
fewer than YJM981 (~150 copies). While we have not identi- 
fied any obvious loss-of-function mutations affecting the key 
telomere maintenance genes in these strains, some of them 
(£571, EST2, R/F2, TEL1, yKu80) appear to have very low ex- 
pression levels in an RNA-seq data set (Skelly et al. 2013). 
These results raise the intriguing possibility that this alterna- 
tive mode of telomere elongation actually occurs in natural 
strains, which to our knowledge has never been 
demonstrated. 

Convergent Evolution of Subtelomeric CNV Underlies 
Natural Arsenic Resistance in S. cerevisiae and 
S. paradoxus 

One region of the genome exhibiting CNV within both 
S. cerevisiae and S. paradoxus is the ARR gene cluster, 
containing three contiguous genes involved in the cellular 
detoxification of arsenic and antimonyl compounds; the tran- 
scription factor ARRl, the arsenate reductase ARR2, and the 
plasma membrane transporter ARR3. We hypothesized that 
CNV in this gene cluster would impact growth phenotypes in 
media containing arsenic compounds and tested this using 
previously collected phenotype data (Warringer et al. 2011). 
In both S. cerevisiae and S. paradoxus, we found a strong 
association between ARR cluster copy number and mitotic 
growth rate, length of mitotic lag phase and mitotic growth 
efficiency in the presence of arsenite (fig. 3A). In an additive 
model, ARR copy number explains 50% (P= 1.5x10^^) and 
92% (P= 1.2x10^^) of the phenotypic variation in growth 
rate in S. cerevisiae and S. paradoxus, respectively, and 71% 
(P = 2.5x10"^) and 82% (P = 5.8x10"^), respectively, of the 
variation in the length of lag time. Interestingly, for the growth 
efficiency phenotype, additivity breaks down as there is no 
difference between strains with one copy and strains with two 
copies. This result is biologically consistent with a model 
where the rate at which the cell can expel arsenic compounds 
increases with the number of ARR copies, but the energy cost 
per arsenite molecule exported is constant and independent 
of copy number above one. 

In S. paradoxus, the distribution of the ARR gene cluster 
CNV tracks the population structure, with all strains from 
the European population having two copies, the North 
American strain having one copy and the two Far Eastern 
strains missing the region. The variant distribution within S. 
cerevisiae shows a more complex pattern with evidence for 
both convergent amplification and introgression between 
lineages (fig. 3B). The region is found in one copy in most 
strains in the species, missing in the Malaysian strain 
UWOPS03-461.4 and the predominantly West Afi-ican 
mosaic strains SKI and Y55, and in two copies in the 
sake strain Y12 and two strains from the Wine/European 
cluster. By computationally phasing the haplotypes of the 
two copies of the Wine/European strain BC187, we found 
that one of these copies clusters phylogenetically with the 
alleles of the other Wine/European strains as expected, 
whereas the other clusters with the Y12 copies, indicating 
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Fig. 3. Convergent evolution of ARR cluster copy number. (A) Growth rate, length of mitotic lag and mitotic growth efficiency in medium containing 
5 mM sodium arsenite oxide for strains with different ARR cluster copy number. Units are on a log2 scale and relative to the S. cerevisiae reference strain 
derivative BY4741. The strain data points are jittered along the horizontal dimension to increase visibility. (B) Distribution of the ARR cluster copy 
number variant within the populations of S. cerevisiae and S. paradoxus. Strain colors denote subpopulation origin as in figure 2. The strain trees are 
neighbor-joining trees based on genome-wide SNP distances, and the scale bars indicates sequence distance in units of SNPs per basepair (distance 
scales differ between the species). (C) The two copies of the ARR gene cluster in the Wine/European strain BC1 87 were computationally phased and the 
sequences of the two copies were clustered with the corresponding sequences from the clean lineage strains of S. cerevisiae using the neighbor-joining 
algorithm. Although the Japanese/Sake strain (Y12) carries two copies, the haplotypes are very similar in sequence and are represented here by a 
consensus version where the few positions that are polymorphic between the two haplotypes have been masked out. The scale bar indicates sequence 
distance in units of SNPs per basepair 



that this copy has been introgressed from the sake lineage 
into parts of the Wine/European population (fig. 3C). The 
other Wine/European strain with a duplication, DBVPG1373, 
however harbors two copies with very low internal sequence 
divergence and with no similarity to the sake copies, implying 
that this is the result of an independent duplication event 
within the Wine/European population. These findings dem- 
onstrate convergent evolution of ARR cluster duplication and 
loss both between different lineages within S. cerevisiae and 
between S. cerevisiae and S. paradoxus. It is tempting to spec- 
ulate that the ARR cluster CNV has been driven by differences 
in environmental arsenic concentrations between the habi- 
tats of different yeast lineages. 



Derived Alleles in Yeast Tend to Be Deleterious and 
Private to a Single Population 

To predict the effects of SNPs on protein function, we em- 
ployed the Sorting Intolerant From Tolerant (SIFT) algorithm, 
which estimates the probability of an amino acid substitution 
being deleterious based on evolutionary conservation of the 
site across a large number of species (Kumar et al. 2009), on all 
nonsynonymous SNPs identified in the 18 S. cerevisiae strains 
with at least 8x coverage mapped to the reference genome. 
Using the S. paradoxus population as outgroup, we also in- 
ferred the most likely ancestral state at each polymorphic 
locus, allowing the polarization of alleles into ancestral and 
derived. Overall, we found a strong tendency for derived 
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Fig. 4. Distribution of SNPs within the S. cerevisiae population. (A) The derived allele frequency spectrum for SNPs with different coding effects. The 
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alleles with higher functional potential to be shifted toward 
low frequencies (fig. 4A). Most derived alleles are present in 
only one (37%) or a few strains, but among nonsynonymous 
alleles the bias is more pronounced with 46% occurring in 
only a single strain. Among nonsynonymous alleles predicted 
to be deleterious the bias toward lower frequencies is even 
stronger with 64% occurring in a single strain. These patterns 
reflect the effects of purifying selection and agree with obser- 
vations made in the human (Abecasis et al. 2012) and 
Ambidopsis thaliana (Cao et al. 2011) populations. Derived, 
and therefore recent, alleles are four times more likely to be 
classified as deleterious than ancestral alleles (fig. 4B), consis- 
tent with elevated negative selection against them in the 
species as a whole. We also specifically considered SNPs 
where the derived allele is private to a single strain and 
found that such alleles present in Wine/European strains 
are more frequently nonsynonymous, and the nonsynony- 
mous SNPs are more frequently predicted deleterious than 
SNPs in other strains (fig. 4C). This suggests relaxed negative 
selection in the Wine/European population, potentially asso- 
ciated with a recent population expansion into a new and 
beneficial niche such as that introduced by humans with the 



emergence and spread of wine production over the last 7,000 
years (Borneman et al. 2013). Running SIFT in the same fash- 
ion on variants identified in the S. paradoxus population, we 
find that derived alleles in this species are on average slightly 
less often deleterious than derived S. cerevisiae alleles (17.8% 
vs. 21.5%), also consistent with recently relaxed selection in 
S. cerevisiae. 

Loss-of-Function Variants in the S. cerevisiae 
Population 

Certain classes of variants are expected to have dramatic 
consequences on gene products and therefore constitute 
particularly interesting candidates for contributing to pheno- 
typic variation. Taking advantage of the well-annotated ref- 
erence genome of S. cerevisiae, we predicted highly probable 
loss-of-function variants in the form of prematurely intro- 
duced stop codons and frameshifting indels across all 19 
S. cerevisiae strains. Overall, these variants are enriched to- 
ward the 3' -end of ORFs (3.7-fold enrichment in the last 5% of 
ORFs, P < 10^^^). This reflects lower purifying selection pres- 
sures against mutations that only perturb translation of the 
very C-terminal end of the protein and is in line with previous 
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Fig. 5. Loss-of-function variants in the S. cerevisiae population. (A) Frequencies of indels and stop-gain SNPs in different categories of genes. Essential 
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observations (Liti, Carter, et al. 2009; Jelier et al. 2011). Within 
ORFs, indels with lengths that are multiples of three 
are highly enriched when compared with noncoding se- 
quence, consistent with strong purifying selection against 
frameshifts (fig. 5A). To test the possibility that full-length 
proteins could be produced from frameshifted genes by 
programmed translational frameshifting, we searched for se- 
quence signatures believed to mediate this process (Theis 
et al. 2008) but found no evidence for this. As a group, 
ORFs classified as dubious do not display the strong signs 
of purifying selection described above, implying that most 
of these are unlikely to constitute functional genes. 

Excluding ORFs classified as dubious and variants posi- 
tioned in the very 3' end of genes (last 2% of the ORF), we 
identified 242 genes harboring loss-of-function variants in at 



least one S. cerevisiae strain. This set is enriched for genes 
classified as functionally uncharacterized (3.1 -fold enrich- 
ment, P < 10^^^), demonstrating that this group of genes 
on average are under lower purifying selection pressures 
(fig. 5A). One explanation for this result could be that there 
is a bias in the investigation, and therefore also in the func- 
tional classification, of yeast genes toward genes that are more 
phenotypically important and therefore also under stronger 
selection. Genes with putative loss-of-function variants are 
also less likely to be essential in the reference strain S288c 
(11-fold depletion, P < 10^^^). Only four essential genes with 
loss-of-function variation were found. In all these cases, the 
variants were positioned either very late in the reading frame 
(PZF1) or very early with an alternative start codon nearby 
(CBF2, LT01), or the affected gene overlapped an essential 
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gene so that the essentiality classification is likely to be false 
{Y]R012C; Ben-Shitrit et al. 2012). 

Subtelomeric genes are overrepresented among genes 
harboring loss-of-function variants (3.5-fold enrichment, 
P < 10^^^), consistent with the idea that the subtelomeres 
are evolutionary dynamic regions for which selection pres- 
sures vary over time and space. Genes that are part of multi- 
copy gene families show a similar higher tendency to harbor 
loss-of-function mutations (fig. 5B). This observation has also 
been made in humans (MacArthur et al. 2012) and is consis- 
tent with a model where these mutations can sometimes 
escape the effects of negative selection due to functional re- 
dundancy with a paralog elsewhere in the genome. However, 
it is also consistent with a model where the kinds of genes 
that tend to have paralogs are simply under lower or more 
variable selection pressures in general (Woods et al. 2013). 
Genes with loss-of-function mutations are enriched for gene 
ontology terms related to transmembrane transport of small 
molecules and ions as well as flocculation. These genes also 
have on average a lower number of gene ontology terms an- 
notated to them (1.89 vs. 2.5 for the whole genome for the 
"biological process" domain, P=1.1x10^^ excluding genes 
with zero terms), indicating a lower degree of pleiotropy. 

We reasoned that if the predicted loss-of-function variants 
are truly disrupting the function of the affected genes, then 
we should see evidence of lower purifying selection acting on 
these genes, in general, within S. cerevisiae. Indeed, genes 
harboring such variants have higher ratios of nonsynonymous 
over synonymous polymorphisms {Tta/n^) within the species 
than genes without loss-of-function variants (0.643 vs. 0.485, 
respectively, P < 10^^^ Fisher's exact test). To test whether 
this relaxed selection pressure is specific to S. cerevisiae or 
reflects a trend that has been running over longer evolution- 
ary timescales, we also evaluated the observed correlation in 
S. paradoxus. Overall, S. paradoxus orthologs of S. cerevisiae 
genes with predicted loss-of-function variants have higher 
TTfl/TTs ratios than other S. paradoxus genes (0.555 vs. 0.457, 
respectively, P < 10^" Fisher's exact test). Thus, the selection 
acting on these genes has been generally low at least over the 
more than 2 billion generations back to the most recent 
common ancestor of S. cerevisiae and S. paradoxus (Dujon 
2010). To test whether the emergence of loss-of-function 
mutations in certain S. cerevisiae strains is associated with a 
further relaxation of selection specifically in these lineages, we 
compared tTo/tTj ratios between those strains carrying the 
loss-of-function variant of a gene and those strains with the in- 
tact ORF and found no difference (0.642 vs. 0.639, P = 0.8774, 
Fisher's exact test). Together, these results are largely consis- 
tent with a nearly neutral model of gene evolution where the 
strength of purifying selection is to a large extent a conserved 
property of the particular gene. 

As a case study for the phenotypic implications of loss-of- 
function variation in natural yeast populations we focused on 
the gene RIM15, in which we identified a frameshifting 2 bp 
insertion private to the Wine/European strain DBVPG6765 
(fig. 5C). We also observed selection against the DBVPG6765 
allele in the genomic region containing RIM 75 during the gen- 
eration of a four-parent advanced intercross line (Cubillos 



et al. 2013). RIM15 encodes a protein kinase believed to be 
involved in the regulation of cell division, proliferation, and 
sporulation in response to nutrient availability (Broach 2012). 
To test whether the identified frameshift variant impairs 
these cellular processes, we constructed diploid hybrids be- 
tween DBVPG6765 and three representatives of other major 
S. cerevisiae lineages. We deleted either of the two RIM15 
copies to obtain reciprocal hemizygote strains that contain 
either only the DBVPG6765 allele or only an allele with an 
intact reading frame. We find that the frameshifted 
DBVPG6765 allele has a massive negative impact on the abil- 
ity of the cell to undergo meiosis and form spores in response 
to nutrient starvation (fig. 5D). Thus, the strain DBVPG6765 
harbors a nonfunctional RIM15 allele despite its strongly det- 
rimental effects on traits considered to be highly related to fit- 
ness. We also find that RIM15 has very low expression level 
in this strain in a RNA-seq data set (Skelly et al. 2013). 
Interestingly, an independent loss-of-function variant in 
RIM15 has been described in sake yeasts, where it is linked 
to a reduced ability to enter quiescence and an associated 
increased rate of ethanol production (Watanabe et al. 2012). 
Further work is needed to elucidate why this strain of 
S. cerevisiae carries this highly deleterious variant. 

Conclusions 

We found surprising and striking differences in the nature of 
genomic diversity within the two yeast species of S. cerevisiae 
and S. paradoxus. Despite lower levels of genetic divergence 
between strains, S. cerevisiae displayed greater diversity in the 
presence and absence as well as copy number of genetic 
material than did S. paradoxus. Our results strongly reinforce 
that the subtelomeres are the major focal regions for func- 
tional evolution; they are almost exclusively the sites for struc- 
tural, gene content and CNV and are also highly enriched for 
loss-of-function variants. The relevance of this subtelomeric 
diversity to phenotypic variation is underscored by the find- 
ing that a third of QTLs for ecologically relevant traits map to 
the subtelomeric regions (Cubillos et al. 2011), even though 
they only constitute approximately 8% of the genome. The 
complex structure of the subtelomeres unfortunately also 
makes them the most problematic regions of the genome 
to assemble and analyze, hampering nucleotide-level dissec- 
tions of this variation and its functional consequences. 
Promising avenues for overcoming this technical limitation 
of short-read sequencing include the subcloning of individual 
subtelomeres, allowing independent sequencing and assem- 
bly, and the use of emerging sequencing technologies that 
produce much longer reads (Bashir et al. 2012; Loomis et al. 
2013). We find systematic trends in the types of genes that 
tend to be affected by certain types of potentially functional 
variation. Genes displaying copy number and loss-of-function 
variation as well as genes not present in the reference genome 
are enriched for functions related to interaction with the 
external environment, for example, sugar transport and me- 
tabolism, flocculation and cell adhesion, and metal transport 
and metabolism. It is plausible that this reflects variation in 
the environmental conditions of different strain habitats, 
leading to selective pressures for these cellular functions 
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that vary across time and space and resulting in either gain or 
loss of gene functions in different lineages. The strong popu- 
lation structure of natural yeast strains has a critical influence 
on virtually all aspects of genomic diversity, and we stress the 
importance of considering this in analysis and interpretation 
of results. Finally, our results demonstrate the high utility of 
short-read next-generation sequencing for yeast population 
genomics, especially the value of de novo assembly to identify 
variation in genome content, and we hope that the sequence 
data and assemblies presented will be useful to the yeast 
community as well as the broader evolutionary and popula- 
tion genetics and genomics communities. 

Materials and Methods 

Genome Sequencing and De Novo Assembly 
Strains were selected for whole-genome sequencing to 
encompass most of the genetic diversity of S. cerevisiae and 
S. paradoxus; at least one strain representing each major phy- 
logenetic lineage as defined in Liti, Carter, et al. (2009) (in 
S. cerevisiae, the North American, Sake/Japanese, West 
African, Wine/European, and Malaysian lineages; in S. para- 
doxus, the far Eastern, American, and European lineages) as 
well as a larger number of strains from the European popu- 
lations of both species. Additionally, in S. cerevisiae, we 
selected the mosaic strains W303, SKI, and Y55, which are 
popular laboratory strains as well as two wild mosaic strains 
isolated from the Bahamas and Hawaii, that phylogenetically 
cluster closer to the non-European lineages than most other 
mosaic strains. 

DNA was extracted using the phenol chloroform protocol. 
Samples were sequenced using the lllumina GAM and HiSeq 
platforms, with 2x 108 bp or 2 x 100 bp paired end libraries 
prepared as previously described (Parts et al. 2011). SGA 
(Simpson and Durbin 2012) was used to perform de novo 
assembly of all strains that had sequencing coverage of 20x 
or higher after quality filtering (by the SGA preprocess pro- 
gram), including scaffolding of contigs using lllumina paired- 
end information. Reads were error corrected with a k-mer 
length of 41 and a minimum of five read pairs were required 
to link two contigs into a scaffold. Contigs/scaffolds smaller 
than 200 bp were discarded. Further scaffolding was then 
performed using low-coverage Sanger capillary data previ- 
ously produced for the same strains (Liti, Carter, et al. 
2009). The paired-end Sanger reads were mapped onto the 
SGA scaffolds using SSAHA2 (Ning et al. 2001), and scaffold 
pairs connected by at least two read pairs in which both reads 
had a mapping quality of 254 were used as input for the 
stand-alone scaffolder SSPACE (Boetzer et al. 2011), which 
was run without contig extension and a maximum allowed 
deviation from the mean pair distance of 50%. Mean pair 
distances were estimated for each strain individually from 
pairs where both reads mapped within the same SGA scaffold. 
To avoid incorrect scaffolding because of collapsed repeats in 
assemblies, scaffold ends showing signs of collapse in the form 
of higher lllumina coverage were excluded from Sanger scaf- 
folding. The lllumina reads were mapped to the SGA assem- 
blies using BWA 0.6.1 (Li and Durbin 2009) with the "-q 10" 



parameter and scaffold edges where the logj-ratio between 
the average coverage of the outermost 5 kb and the genome- 
wide median coverage was higher than 0.5 were excluded. 
After scaffolding with the Sanger reads, the SGA gapfill pro- 
gram was run to fill in some of the resulting gaps using the 
error-corrected lllumina reads. For four of the S. paradoxus 
strains assembled (Y8.5, Y9.6, Z1, and W7), no Sanger reads 
are available, and so further scaffolding could not be per- 
formed for these strains. Genome sizes reported in the text 
do not take the large ribosomal DNA (rDNA) tandem repeat 
array on chromosome XII into account, which in the S288c 
reference genome is represented by two copies (Johnston 
et al. 1997) and which collapses into a single copy in our de 
novo assemblies. We also note that the mitochondrial ge- 
nomes did not assemble well, likely a result of the very high 
AT content. 

Contaminant sequences displaying high (~99%) identity to 
genomes from the prokaryotic Stapliylococcus genus were 
identified in the assembly of the strain DBVPG6044. All con- 
tigs/scaffolds that had a BlastN match to a Staphylococcus 
species in the top five hits from the NCBI nr database were 
removed from the assembly. No hybrid contigs or scaffolds 
containing both Saccharomyces and Staphylococcus were 
found. 

Assembly Scaffolding Using Genetic Linkage Data 
Four of the S. cerevisiae strains have been used as parents for 
advanced intercross lines as part of projects to study complex 
traits and recombination, and the genomes of a large number 
of segregants from these lines have been sequenced: 192 F12 
segregants from a two-parent cross between YPS128 and 
DBVPG6044 (lllingworth et al. 2013) and 192 F12 segregants 
fi-om a four-parent cross between YPS128, DBVPG6044, Y12, 
and DBVPG6765 (Cubillos et al. 2013); in both cases se- 
quenced by paired-end lllumina technology with 100-bp 
read lengths. The patterns of linkage disequilibrium (LD) 
between SNPs in these artificial populations were used to 
further scaffold the de novo assemblies of the four parental 
strains. First, for each of the parental genome assemblies, the 
lllumina reads for the other parental strains were mapped to 
the assembly and SNPs were called (read mapping and SNP 
calling performed as described in the section "Read-mapping 
and variant calling"). For the two strains of the two-parent 
cross (YPS128 and DBVPG6044), only data from this cross was 
used, and data from the four-parent cross was used only for 
the remaining two strains (Y12 and DBVPG6765). The segre- 
gant individuals were then genotyped at the resulting lists of 
SNPs using samtools mpileup 0.1.18 (Li et al. 2009), assigning 
the corresponding allele if the log-likelihood ratio between 
the homozygous states of the two alleles was bigger than 10 
or smaller than —10, assigning unknown genotype if in- 
between. Segregant individuals showing signs of diploidy or 
DNA contamination as assessed by genome-wide patterns of 
heterozygosity were excluded (20 out of 192 individuals in the 
two-parent cross and 15 out of 192 in the four-parent cross). 
Using the resulting genotype calls, LD in units of was then 
computed between all pairs of SNPs using PLINK (Purcell et al. 
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2007). A model for the approximation of physical base-pair 
distances from genetic distances of the form = e^^'^, where 
d is physical distance between two SNPs in units of base pairs 
and 1 '\s 3l constant, was fit by nonlinear least squares to the 
set of values from SNPs located within the same scaffold (only 
scaffolds of size 50 kb and bigger were used for fitting). LD 
between SNPs on different scaffolds was used to construct 
pairwise scaffold-scaffold links, with relative orientations in- 
ferred by comparing the strength of LD in the four corner 
elements of the matrix of LD values between all SNPs in the 
two scaffolds. The pairwise scaffold-scaffold links then con- 
stitutes a directed graph through which each linkage group 
should be traceable as a simple path. The paths were traced 
partly using the scaffolding algorithm of SCA and partly 
by manual curation aided by visualizations in Cytoscape 
(Shannon et al. 2003). Scaffolds that could be positioned 
but not reliably oriented because of a lack of difference in 
LD profiles between the two ends or because they did not 
have at least two SNPs were not included in the linkage group 
scaffolds but left as unplaced. 

Genome Assembly Validation 

Diagnostic PCR was used to confirm the absence and pres- 
ence of genomic regions identified as variable between strains 
in the de novo assemblies. Primers were designed to target 14 
different regions in all 14 strains of S. cerevisiae with de novo 
assemblies plus the S288c reference strain and for 9 of the 
regions also in all of the 13 S. paradoxus strains with de novo 
assemblies (supplementary fig. S3, Supplementary Material 
online). Different primers were designed for the two different 
species and placed in nonpolymorphic locations. Primers 
are listed in supplementary table SI, Supplementary 
Material online. 

As an additional validation, the de novo genome assembly 
of the S. cerevisiae strain SKI was compared with another 
recently released assembly of this strain (van Overbeek 
et al. http://cbio.mskcc.org/public/SK1_MvO/ [last accessed 
October 4, 2013]; Sasaki et al. 2013), produced primarily from 
454 pyrosequencing reads and with additional error-correc- 
tion and gap closing efforts. The assemblies were compared to 
identify any false negatives in the sense of sequence incor- 
rectly left out of our assembly and false positives in the sense 
of sequence or assembly artifacts in our assembly that do not 
represent actual sequence present in the genome of this 
strain. A single region larger than 1 kb was found to be present 
in the van Overbeek et al. assembly and absent in ours. 
Inspection revealed that this is a technical cloning vector 
that has not been introduced into the SKI derivative se- 
quenced here, and thus does not represent genuinely missing 
biological sequence. Thirty-one regions larger than 1 kb were 
present in our assembly and absent in the van Overbeek et al. 
assembly. To test whether these are assembly artifacts in our 
assembly or sequence incorrectly left out of the van Overbeek 
et al. assembly, the 454 reads used to construct the van 
Overbeek et al. assembly were mapped back to our SKI as- 
sembly using SSAHA2 (Ning et al. 2001), and the depth of 
coverage was assayed in these 31 regions. Excluding the 



100 bp at the edge of contigs, all positions in these regions 
were covered by five or more reads with mapping qualities of 
at least 75. This shows that these sequences are actually pre- 
sent in the strain SKI but have for some technical reason been 
left out of the van Overbeek et al. assembly. No false positives 
and no false negatives of this kind were thus identified in our 
SKI de novo assembly. 

Reference Genomes and Annotation Data Used 
For S. cerevisiae, the S288c reference genome and correspond- 
ing annotation files (Release R64-1-1, downloaded from the 
Saccharomyces Genome Database [http://yeastgenome.org/, 
last accessed January 28, 2014] on February 5, 201 1) was used 
for reference-based analyses. The sequence of the 2-).im circle 
plasmid was downloaded from NCBI's GenBank (accession 
NC_001 398). For S. paradoxus, the CBS432 reference genome 
from Liti, Carter, et al. (2009) was used, with annotation being 
obtained from Scannell et al. (2011). A S. paradoxus CBS432 
mitochondrial sequence was obtained from Prochazka et al. 
(2012) (GenBank accession: JQ862335.1). For analyses assaying 
the presence of genomic material in the S. paradoxus refer- 
ence genome, an alternative version of the CBS432 assembly 
that includes some additional sequence that was incorrectly 
left out from the original assembly was used (available from 
Liti, Carter, et al. [2009]). We define the subtelomeric regions 
as the outermost 33 kb of each reference chromosome, fol- 
lowing Brown et al. (2010). Annotation on the essentiality of 
genes was that of the S. cerevisiae reference strain S288c. 

Identification of Genome Content Differences 
To identify genomic material present in a given genome but 
absent in another, alignments were constructed between 
genome assemblies using BlastN without low-complexity fil- 
tering, and a region in the query sequence was called as not 
present in the target sequence if no alignment of either length 
100 bp and sequence identity of 75% or length 50 bp and 
sequence identity 90% was found. For the reported results 
only such identified regions of a minimum length of 1,000 bp 
were considered. 

Genome Assembly Annotation 

The protein sequences of all nuclear ORFs annotated as gen- 
uine genes in the reference genomes of S. cerevisiae and 
S. paradoxus, respectively (in the former species, all ORFs 
not classified as "dubious" and in the latter all ORFs classified 
as "real") were searched for in the de novo assemblies using 
exonerate version 2.31.12 (Slater and Birney 2005) with the 
"protein2dna" alignment model. For each reference protein 
query, the most likely homologous gene in each strain 
genome was inferred by sequence similarity and synteny com- 
parisons from the set of all candidate alignments with more 
than 90% sequence similarity to the query and with an exon- 
erate alignment score within 90% of the top scoring align- 
ment for the gene. This was done by identifying and selecting 
in priority order top-scoring alignments supported by gene 
synteny to the reference genome on both sides, top-scoring 
alignments supported by synteny on one side and being right 
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next to the edge of the scaffold on the other side, and nontop 
scoring alignments supported by synteny on both sides. If 
more than one reference protein query had their inferred 
homolog localized to the same part of the assembly (both 
start and end positions differing by less than 100 bp), only one 
of them was kept (prioritizing perfect synteny support com- 
bined with being the top scoring alignment, then higher align- 
ment score, and then longer gene length). 

Ab initio gene prediction for genes not present in the 
reference genome was performed using GeneMarkS version 
4.10d (Besemer et al. 2001) in the self-training mode and with 
the "— euk" option for intronless eukaryotic gene prediction. 
Predicted genes that did not overlap the coordinates of the 
inferred reference genome homologs and that additionally, to 
account for potential reference homologs missed by the 
above homology inference, did not display high similarity to 
any reference genome gene (defined as the presence of a 
BlastN hit covering either 80% of the query length or 
200 bp and with a sequence similarity of at least 90%) were 
classified as nonreference genes. For the numbers reported, 
only predicted genes with lengths of 300 bp or more were 
included. 

Read-Mapping and Variant Calling 
For purposes of identification of SNPs and short indels, reads 
cleaned from adapter contamination using cutadapt (http:// 
code.google.eom/p/cutadapt/, last accessed January 11, 2011) 
were mapped to reference genomes and de novo assemblies 
using Stampy 1.0.18 (Lunter and Goodson 2011) with the 
"sensitive" parameter, in hybrid mode with BWA 0.5.9 and 
with the "— q 10" BWA parameter. Nonprimary alignments 
and nonproperly paired reads were filtered out and duplicate 
reads were removed using Picard (http://picard.sourceforge. 
net/, last accessed October 22, 2012). Before SNP calling reads 
were locally realigned using SRMA 0.1.15 (Homer and Nelson 
2010) and read base qualities were capped by their Base 
Alignment Qualities (Li 2011) as computed by samtools 
0.1.18 (Li et al. 2009). SNPs were called on the read alignments 
using FreeBayes 0.9.5 (http://bioinformatics.bc.edu/marthlab/ 
wiki/index.php/FreeBayes, last accessed May 2, 2012) set for 
haploid samples. Short indels were identified using Dindel 
1.01 (Albers et al. 2011) executed in pooled mode. 
Individual indel genotype calls for each haploid strain were 
obtained by extracting the genotype likelihoods as computed 
assuming diploid samples and assigning the corresponding 
allele if the log-likelihood ratio between the homozygous 
states of the two alleles was bigger than 5.3 or smaller than 
—5.3, assigning unknown genotype if in-between. Only indels 
shorter than 60 bp were called. Indels were not counted as 
affecting the ORF of a gene if the equivalent indel region 
(Krawitz et al. 2010) was not completely contained within 
the ORF. 

Copy Number Variation 

CNV was identified by mapping the lllumina reads to the 
S. cerevisiae and S. paradoxus reference genomes. The average 
depth of read coverage was computed in nonoverlapping 



windows of size 500 bp and normalized by the genome- 
wide median coverage for each strain, and the log2 values 
of these ratios were then plotted. Regions of the genome 
showing coverage variation between strains were identified 
manually by systematically inspecting the plots. Regions an- 
notated with transposable element associated features were 
masked out at the plotting level and excluded from the anal- 
ysis. As the S. paradoxus reference genome is not comprehen- 
sively annotated for transposable elements, masking was 
applied to regions of the genome displaying high sequence 
similarity to any S. cerevisiae transposon-associated feature 
sequences (BlastN e-value <10^^^). One strain of 
S. cerevisiae (YJM981) and four strains of S. paradoxus 
(KPN3829, Q31.4, Q32.3, UFRJ50816) were excluded fi-om 
copy number analysis because they had a coverage of reads 
mapped to the reference genome of below 8x. For the gene 
family-based CNV analysis, read depth was aggregated across 
all members of a gene family and normalized by the genome- 
wide median coverage as above. The gene families used are 
those defined in Christiaens et al. (2012). Although the true 
amount of CNV in S. paradoxus is likely slightly underesti- 
mated due to the presence of gaps in the reference genomes, 
this underestimation is unlikely to explain the large difference 
in the amount of CNV observed between 5. cerevisiae and 
S. paradoxus. 9% of subtelomeric sequence (and 1.5% of the 
whole genome) in the S. paradoxus reference assembly con- 
sists of gaps — assuming very conservatively that all of this 
gapped subtelomeric sequence harbors CNV the estimate 
for the total size of genomic regions containing CNVs 
would increase from 142 to 232 kb (and to 319 kb extending 
the assumption to all gapped sequence in the whole 
genome), which is still considerably smaller than the 423 kb 
in S. cerevisiae. 

ARR Gene Cluster Analysis 

The haplotypes of the two ARR gene copy clusters in the 
strain BC187 were phased by first calling SNPs in a 6,400-bp 
region encompassing the cluster using samtools mpileup on 
mapped lllumina reads, and then phasing the SNPs using the 
ReadBackedPhasing program from the Genome Analysis 
Toolkit (McKenna et al. 2010) with both lllumina and 
Sanger paired-end reads as input. For the strains Y12 and 
DBVPG1373, the two ARR cluster copies were not sufficiently 
diverged for phasing to be possible. Neighbor-joining trees 
were constructed by Seaview (Gouy et al. 2010), and for 
Y12 a single consensus sequence for the two haplotypes 
was constructed for phylogenetic analysis by excluding the 
positions where the two copies differ in sequence (five posi- 
tions). Data from all the strains on the mitotic growth rate, 
length of mitotic lag, and mitotic growth efficiency in 
a medium containing 5 mM sodium arsenite oxide was 
obtained from Warringer et al. (2011). 

Gene Ontology Enrichment 

Gene ontology enrichment analyses were performed at 
YeastMine (http://yeastmine.yeastgenome.org/, last accessed 
February 6, 2013) with the Benjamini-Hochberg correction 
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for multiple testing and a P-value threshold of 0.05. As gene 
ontology annotation is not directly available for S. paradoxus, 
genes were mapped to their S. cerevisiae orthologs and anal- 
yses were performed on the resulting S. cerevisiae gene sets. 
Ortholog mappings were obtained from Scannell et al. (201 1). 

R/A/115 Phenotyping 

R/A/115 reciprocal hemizygosity strains were constructed 
by one-step PCR deletion with URA3 as a selectable 
marker (Salinas et al. 2012). The gene was deleted in the 
haploid versions of the four parental strains (either Mat a, 
ho::HygMX, ura3::KanMX or Mat a, hor.NatMX, ura3::KanMX) 
and deletions were confirmed by PCR. Strains of opposite 
mating type were crossed to generate the hemizygotic 
hybrid diploid strains. Sporulation efficiency was then mea- 
sured as follows. Strains were grown in 50 ml of YPG (2% 
peptone, 1% yeast extract, 3% glycerol, and 0.1% glucose) 
overnight, washed with water three times, transferred to 
50 ml of 2% potassium acetate and incubated with shaking 
at 23 °C. Samples were taken at 1, 2, 4, and 7 days after the 
start of incubation, and in each sample, the number of cells 
having formed asci was counted using an optical microscope. 
Two-hundred cells were assayed for each sample. 

Supplementary Material 

Supplementary figures S1-S6 and tables SI are available at 
Molecular Biology and Evolution online (http://www.mbe. 
oxfordjournals.org/). 
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